Using graph-based consensus clustering for combining K-means clustering of heterogeneous chemical structures
نویسندگان
چکیده
Consensus clustering methods are motivated by the success of combining multiple classifiers in many areas. In this paper, graph-based consensus clustering is used to improve the quality of chemical compound clustering by enhancing the robustness, novelty, consistency and stability of individual clusterings. For this purpose, HyperGraph Partitioning Algorithm (HGPA) [1], was applied. The clustering is evaluated based on the ability to separate actives from inactives molecules in each cluster and the results were compared with the Ward’s clustering method. The chemical dataset MDL Drug Data Report (MDDR) database has been used for experiments. The MDL Drug Data Report (MDDR) database consists of 102516 molecules. For the experiments, the dataset DS1 was chosen from the MDDR database. This dataset has been used for many virtual screening experiments [2-4]. The dataset DS1contains 10 heterogeneous activity classes (8568 molecules). For the clustering experiments, two 2D fingerprint descriptors will be used which are developed by Scitegic’s Pipeline Pilot [5]. These are 120-bit ALOGP and 1024-bit extended connectivity fingerprints (ECFP_4). The results were evaluated based on the effectiveness of the methods to separate actives from non-actives molecules using QPI(for quality partition index) measure, which was devised by Varin et al. [6]. As defined by [7], an active cluster as a non-singleton cluster for which the percentage of active molecules in the cluster is greater than the percentage of active molecules in the dataset as a whole. Let p be the number of actives in active clusters, q the number of inactives in active clusters, r the number of actives in inactive clusters (i.e., clusters that are not active clusters) and s the number of singleton actives. The high value occurs when the actives are clustered tightly together and separated from the inactive molecules. Then the quality partition index, QPI, is defined to be:
منابع مشابه
انتخاب اعضای ترکیب در خوشهبندی ترکیبی با استفاده از رأیگیری
Clustering is the process of division of a dataset into subsets that are called clusters, so that objects within a cluster are similar to each other and different from objects of the other clusters. So far, a lot of algorithms in different approaches have been created for the clustering. An effective choice (can combine) two or more of these algorithms for solving the clustering problem. Ensemb...
متن کاملGROUND MOTION CLUSTERING BY A HYBRID K-MEANS AND COLLIDING BODIES OPTIMIZATION
Stochastic nature of earthquake has raised a challenge for engineers to choose which record for their analyses. Clustering is offered as a solution for such a data mining problem to automatically distinguish between ground motion records based on similarities in the corresponding seismic attributes. The present work formulates an optimization problem to seek for the best clustering measures. In...
متن کاملClustering of nasopharyngeal carcinoma intensity modulated radiation therapy plans based on k-means algorithm and geometrical features
Background: The design of intensity modulated radiation therapy (IMRT) plans is difficult and time-consuming. The retrieval of similar IMRT plans from the IMRT plan dataset can effectively improve the quality and efficiency of IMRT plans and automate the design of IMRT planning. However, the large IMRT plans datasets will bring inefficient retrieval result. Materials and Methods: An intensity-m...
متن کاملEntropy-based Consensus for Distributed Data Clustering
The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...
متن کاملA Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)
Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 5 شماره
صفحات -
تاریخ انتشار 2013